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PROTEIN THREADING BY LINEAR 
PROGRAMMING 

JINBO XU, MING LI 
December 31, 2002 

Protein three-dimensional structure prediction through threading ap- 
proach has been extensively studied and various models and algorithms have 
been proposed. In order to further explore ways to improve accuracy and 
efficiency of the threading process, this invention proposes a new method: 
protein threading via linear programming. Based on the contact map model 
of protein 3D structure, we formulate the protein threading problem as a 
large scale integer programming problem, then relax to a linear program- 
ming problem, and finally solve the integer program by a branch- and- bound 
method. The final solution is optimal with respect to energy functions incor- 
porating pairwise interaction and allowing variable gaps. Our formulation 
often allows the linear program give integral solutions. The algorithm has 
been implemented as software package RAPTOR - RApid Protein Thread- 
ing predictOR. Experimental results show that RAPTOR significantly out- 
performs other programs for fold recognition. The RAPTOR webserver is 
at http://www.cs.uwaterloo.ca/~j3xu/RAPTOR_form.htm 

1 Introduction 

The Human Genome Project has led to the identification of over 30 thou- 
sand genes in the human genome, which might encode, by some estimation, 
100,000 proteins as a result of alternative splicing. To fully understand the 
biological functions and functional mechanisms of these proteins, the knowl- 
edge of their 3-D structures is required. The ambitious Structural Genomics 
Initiatives, launched by NIH in 1999, intends to solve these protein struc- 
tures within the next ten years, through the development and application 
of significantly improved experimental and computational technologies. 

A protein structure is typically solved using x-ray crystallography or nu- 
clear magnetic resonance spectroscopy (NMR), which axe costly and very 
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time-consuming (ranging from months to yeaxs per structure) and is quite 
difficult for high- throughput production. The overall strategy of the NIH 
Structural Genomics Initiatives is to solve protein structures using experi- 
mental techniques like x-ray crystallography or NMR only for a small frac- 
tion of all the proteins and to employ computational techniques to model the 
structures for the rest of the proteins. The basic premise used here is that 
though there could be millions of proteins in nature, the number of unique 
structural folds is probably 2-3 (or even more) orders of magnitude smaller. 
Hence by strategically selecting the proteins with unique structural folds 
for experimental solutions, we can put the vast majority of other proteins 
"within the modeling distance" of these proteins. Model-based structure 
prediction techniques could play a significant role in helping to achieve the 
goal of the Structural Genomics Initiatives. Protein threading represents one 
of the most promising such techniques. 

Protein threading can be used for both structure prediction and protein 
fold recognition, i.e., detection of homologous proteins. Numerous com- 
puter algorithms have been proposed for protein structure prediction, based 
on the threading approach. Based on the energy function models and com- 
putational methods, they can be grouped into three classes: 

(1) The energy function does not include the pairwise interaction pref- 
erences explicitly. For this kind of model, a simple dynamic programming 
is employed to optimize the energy function. GenTHREADER[l] is a typ- 
ical example. The prediction speed is fast, but theoretically, the prediction 
accuracy is worse than those incorporating pairwise interactions. 

(2) The energy function includes the pairwise interaction preferences. 
However, it has been proved that this problem is NP-hard when variable gaps 
and pairwise interactions are considered simultaneously [2], Some kinds of 
approximation algorithms are used to optimize the energy function. These 
methods include double dynamic programming[3], frozen approximation^], 
and Monte Carlo sampling algorithm[5]. Unfortunately, T. Akutsu have 
proved that this problem is MAX-S NP-hard [6], which means that it cannot 
be approximated to arbitrary accuracy in polynomial time. 

(3) The energy function includes the pairwise interaction preferences and 
an exact algorithm is designed to optimize the energy function. Xu et al 
have proposed a divide-and-conquer method[7] which runs fast on simple 
protein template (interaction) topology but could take a long time for pro- 
teins with dense residue-residue interactions. In addition, this approach 
does not treat the following two special features explicitly: (i) interaction 
significance could be different from residues to residues; and (ii) interaction 
potentials could be heavily correlated with other non-pairwise scores such 
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as mutation scores and fitness scores. 

Our main focus in this paper is on the development of a globally op- 
timal and practically efficient threading algorithm based on the alignment 
model incorporating the pairwise interaction preferences explicitly and al- 
lowing variable gaps by using the integer programming approach. Integer 
programming formulation can fully exploit the abovementioned two special 
features of the pairwise interaction preferences. It allows us to use the exist- 
ing powerful linear programming packages together with some branch and 
bound algorithm to rapidly arrive at the optimal alignment, lb our knowl- 
edge, this is the first time that integer programming is applied to protein 
threading. 

2 Alignment Model 

We represent the amino acid sequence, of length m, of a protein template by 
t\t2 . ■ . tm and the query sequence, of length n, by s\S2 - . . s n . In formulating 
the protein threading problem, we follow a few basic assumptions widely 
adopted by the protein threading community [7]. We assume that: 

(1) Each template consists of a linear serieis of cores with the connecting 
loops between the adjacent cores. Each core is a conserved segment of 
an ohelix or /?-sheet secondary structure among the protein's homologs. 
Although the secondary structure is often conserved, insertion or deletion 
may occur within a secondary structure. So we only keep the most conserved 
part. Let c% = core(headi,taili) denote all cores of one template, where 
i = 1, 2, . . . ,M with M being the number of the cores, and 1 < head\ < 
tail\ < head2 < taih < . . - < head^ < tailj^ < m. The region between 
taili and headi+ i is a loop. The length of Ci is lerii = tail* — head* + 1. Let 
loci denote the stun of the length of all cores before c*, i.e., loci = lenj. 

(2) When aligning a query protein sequence with a template, alignment 
gaps axe confined to loops, i.e., the regions between cores or the two ends 
of the template. The biological justification is that cores are conserved so 
that the chance of insertion or deletion within them is very little. 

(3) We consider only interactions between core residues. It is generally 
believed that interactions involving loop residues can be ignored as their 
contribution to fold recognition is relatively insignificant. We say that an 
interaction exists between two residues if the spatial distance between their 
C Q atoms is within 7 A and they are at least 4 positions apart in the template 
sequence. We say that an interaction exists between two cores if there exists 
at least one residue-residue interaction between the two cores. 
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Our threading energy function consists of environment fitniess score E s , 
mutation score i? m , secondary structure compatibility score E ss . gap penalty 
Eg and pair wise interaction score E p . The overall energy function E has the 
following form: 

E = W m E rn 4- W S E S + W P E P + W g E g + W 5S £ JS5 

where W^, VK 5 , W^, W^, Vr ss are weight factors to be determined by training. 

Global alignment and global-local alignment methods are employed to 
align one template to one sequence. For the detailed description, please 
refer to Fischer's paper [8] . In the case that the template size is larger than 
the query sequence size, it is possible that some cores at the two ends of 
the template cannot be aligned to the sequence. But we can always extend 
the sequence by adding some "artificial" amino acids to the two ends of the 
sequence to make all cores are aligned to the (extended) sequence. All scores 
involving those extended positions are set to be 0. 

3 Formulation 

Definition 3.1 We use an undirected graph CMG = (V, E) to denote the 
contact map graph of a protein template structure. Here } V = {c\. C2, cm} 
where ci represents i th core, and E — {(ci,Cj)\there is an interaction between 
Ci and Cj, or \i — j\ = 1}. 

For simplicity, when we say that core is aligned to position sj, we al- 
ways mean that core Ci = {head^taili) is aligned to segment (sj,Sj+/ e >i<-i)« 
In order to speed up the search, RAPTOR employs some knowledge-based 
filtering process proposed in PROSPECT[7] that indicates certain align- 
ments as invalid. 

Definition 3.2 Let B denote the alignment bipartite graph of one threading 
pair. Each core of the template corresponds to one vertex in B, labeled as 
Ci(i = 1,2,... ,Af), each residue in the query sequence corresponds to one 
vertex in B } labeled as sj(j = l,2,...,n). The edges of B consist of all 
valid alignments (after initial filtering) between each core and each sequence 
position. The edges of B are also called the alignment edges. 

Definition 3.3 For any two different edges ei = (c^,.^,) ande 2 = (ci 2 ,$j 2 ) 
in an alignment bipartite graph B } if (toc^-focij x (s h +loci 2 -loc u -s h ) < 
0, then we say e\ and e<i are in conflict. 
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The proof of the following three lemmas is omitted due to space limit. 

Lemma 3.1 For any three different edges e T = {ci r ,Sj r ), r = 1,2,3 and 
loc^ < loa 2 < loci 3 , if e% conflicts with e 2 and e 2 conflicts with e$, then e\ 
conflicts with e$. 

Lemma 3.2 For any three different edges e r = {ci r ,Sj r ), r = 1,2,3 and 
loci v < loCi 2 < loCi Z} if e\ does not conflict with e 2 and e2 does not conflict 
with e$, then e\ does not conflict with e^. 

Lemma 3.3 For any three different edges e r = (ci r ,Sj T ) } r = 1,2,3 and 
locii < loCi 2 < loa 3 , if e\ conflicts with e^, then e 2 conflicts with e$ or e 2 
conflicts with e\ . 

Definition 3.4 An alignment is called a valid alignment if: (1) each core of 
the template is aligned to some position of the (extended) sequence 1 ; (2)For 
any two different cores c\ and c 2 , their two alignment edges do not conflict 
in the alignment graph. 

Let Xij be a boolean variable such that x^i = 1 if and only if core Ci is 
aligned to position s,. Similarly, for any (c^ , a 2 ) G E(CMG), let y^),^) 
indicate the pairwise interactions between x^^ and X{ 2y i 2 if the two edges 
(<*i>*d)i(ci2i*f2) do not conflict. ypxA),^^) = 1 if and only if x iuil = 1 
and x i2yh = 1. We say y^),^) is generated by x ilh and x iiih . 

The x variables are called the alignment variables and y variables are 
called the interaction variables. Let D[i] denote all valid query sequence 
positions that c» could be aligned to. Let R[iJJ] denote all valid alignment 
positions of Cj given Ci is aligned to S(. 

Now the objective function of the protein threading problem can be 
formulated as follows: 



min W m E m + W S E S + W p E p + W g E 9 + W ss E t 



(1) 



]C \. X U x ^2 Mutation(headi + rj + r)], 

i=l l€D[i] r=0 



(2) 



Es = Y2 l x *J x ]L Fit>ness(headi + rj + r)J, 



(3) 



As mentioned before, global and global-local alignment are employed. 
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M lew- 1 

^ = E E^M* E 55(^ ea d,+r,5^ r )], (4) 



E EE ymu*)P(iJJ,k), (5) 

L<i<i<Af,(c i>Ci )€£(CMG) i€0[i] *€K[i,jV] 
P(t\ j, Z, k) = £ E *('fce«<i+tti theadj+v)Pair(l + u,k + v) (6) 

^ = E E E 2/W),(i+i,^)G(i, i, *), (7) 

where t v ) = 1 if there is an interaction between residues at position 
u and v in the template, otherwise 0. G(i,Z, k) is the gap potential between 
c% and Ci+i when they are aligned to query sequence position I and k respec- 
tively. G{i,l,k) could be computed by dynamic programming in advance 
given k. 

The constraint set is as follows: 

£ = 1^ = 1,2,. ..,M; (8) 



E x ^ + E ^ Mo € t = 1,2,... ,M — 1; 

1>1 0 ,16D[H *e0[»+l]-/Z[t,t+Uo] 

_ (9) 

E »W).C**> ^ *«• W e ^W.*»i = 1,2,. . . ,M; (10) 

E »W).tt.*) ^ *M. v * S D\j], i,j = 1,2, ... , M; (11) 

J€«[7'M] 

E 1/W),(j.fc) ^ *W + E Xj,k-l,lED\i},i,j = 1.2,...,M; (12) 
fc€A[»J,i] k€R{ij,l] 

E 2/(i,0,(7,*) > + E ^ > l-l,*eJ?[?] ) i,i = l,2,...,M; (13) 

Xij > 0, j €£>[»],* = 1, 2,..., M; (14) 

J/(i,i)(i,*) > 0 S W € £>[»], k € DM, i, j = 1,2, .... M. (15) 
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Constraint 8 says that one core can be aligned to a unique sequence 
position. Constraint 9 forbids the conflicts between the adjacent two cores. 
Therefore, based on lemma 3.2, this constraint can guarantee that there are 
no conflicts between any two cores if variable x and y are integral. Con- 
straints 10 and 11 say that at most one interaction variable can be 1 be- 
tween any two cores that have interactions between each other. Constraints 
12 and 13 enforce that if two cores have their alignments to the sequence 
respectively and also have interactions between them, then at least one in- 
teraction variable should be 1. Constraints 8,14 and 15 guarantee x and y 
to be between 0 and 1 when this problem is relaxed to linear program. 

There is another set of more obvious constraints which can replace Con- 
straints 9-13. They are: 

Xij+Xi+ijb < l,Vfc 6 D[i 4- 1] - R[i,i + 1,/]; (16) 
2/(M)(i,*0 < x i,h & € h 11 fa, Cj) 6 fi(CMG); (17) 
VMU*) < ' ^ fltf. h *], (ci: cj) € E(CMG)\ (18) 

yam*) £ x u + *i* - l > fe« c j) e e{cmg); (19) 

Constraint 16 forbids the conflict between two neighboring cores and 
Constraints 17-19 guarantee that one interaction variable is 1 if and only if 
its two generating x variables are 1. Constraints 16-19 can be inferred from 
Constrains 9-13. Conversely, it is not true. Therefore, Constraints 16-19 are 
weaker than Constraints 9-13. 

In order to improve running time, we found yet another set of Constraints 
20 and 21 from which both 9-13 and 16-19 can be inferred (proofs omitted 
due to space limitation). 

£ VHm*) = *V> c >) e E{CMG)\ (20) 

£ y&0tf,fc) = c i) G E{CMG)\ (21) 

Constraints 20 and 21 imply that one x variable is 1 is equivalent to that 
one of the y variables generated by it is 1 . These two are the strongest 
constraints. Experimental results show that our algorithm with Constraints 
20 and 21 (combining with Constraints 8,14 and 15) runs significantly faster. 
Our program RAPTOR uses 20 and 21 by default. 
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4 RAPTOR - Implementation 

4.1 Scoring System 

We calculated the averaged energy over a set of homologous sequences, as 
demonstrated in PROSPECT-II[9]. Given a query sequence of length n, an 
n x 20 frequency matrix PSFM is calculated by using PSI-BL AST[10] with 
maximum iteration number being set to 5. Each column of this matrix de- 
scribes the occurring frequency of 20 amino acids at this position. Assume a 
template position i is aligned to the sequence position j. Then the mutation 
score and fitness score are calculated as follows. 

Mutational, j) = 5Zpj )a M(^,a) 

a 

Fitness(iJ) = ^Pj yQ F(env i} a) 

a 

where pj A represents the occurring frequency of amino acid a at sequence 
position j, M(a, 6) represents the mutation potential between two amino 
acid a and b which is taken from PAM250 matrix[ll], F(env. a) denote the 
fitness potential when amino acid a is placed into environment env. 

The 9 combinations of three secondary structure types (a-helix, /3-strand 
and coil) and three solvent accessibility levels are used to define the local 
environments of a position in the template. The boundaries between thre 
solvent accessibility levels are at 7% and 37%. Secondary structure and 
solvent accessibility assignments are all taken from FSSP database[12]. 

The gap penalty function is assumed to be an affine function, i.e., a 
gap open penalty plus a length-dependent gap extension penalty. Gap open 
penalty is set at 10.6 and gap elongation penalty is 0.8 per single gap[13]. We 
use^PSIPRED [14] to predict the secondary structure of the query sequence. 

If the two ends of an interaction are aligned to jf* and ffi positions of 
the| query sequence respectively, then the pair score for this interaction is 
giv|n by: 

f Fair{j u j 2 ) = Y,Ph>* 53ft"a^ P ( a » 6 ) 

where P(a,b) denotes the pairwise interaction potential between two amino 
acids a and b. F, P are taken from PROSPECT-II[9]. 

4.2 Branch-and-Bound Method 

We use a branch- and- bound algorithm to solve the above integer program- 
ming problem. First wc relax the above integer program by allowing all x 
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and y to be real between 0 and 1 and solve the resulting linear program. If 
the solution (x*, y*) of the linear program is integral, then we get the optimal 
solution. Otherwise, we select one non-integral variable according to some 
criterion, and generate two subproblems by setting it to 0 and 1 respectively. 
These two subproblems are solved recursively. More details on solving in- 
teger programming problem can be found in [15]. IBM OSL(Optimization 
and Solution Library) package is used to implement this process. 

4.3 Weight Training 

The weight factors W mi W s , W ss , W 9 , W p are chosen through optimizing the 
overall alignment accuracy. The optimal alignment accuracy does not nec- 
essarily imply the best fold recognition capability though. In the following 
subsection, an SVM (Support Vector Machine) method is used to carry out 
fold recognition. A set of 95 structurally-aligned protein pairs are chosen 
from Holm's test set[16] as the training samples, each of which only has 
fold-level similarity. The alignments generated by RAPTOR is compared 
with the structural alignments generated by SARF[17]. An alignment for a 
residue is regarded as correct if it is within 4 residue shift away from the cor- 
rect structure-structure alignment by SARF. The overall alignment accuracy 
is defined as the ratio between the number of the correctly-aligned positions 
of all threading pairs and the number of the maximum alignable positions. 
Our objective is to maximize the overall alignment accuracy. A genetic al- 
gorithm plus a local pattern search method implemented in DAKOTA[18] is 
used to search for the optimal weight factors. We attained 56% alignment 
accuracy over this set of training pairs. A set of 1100 protein pairs which 
axe in the fold-level similarity is also generated from Holm's test set[16] to 
test the weight factors and 50% alignment accuracy is attained. We have 
also selected 95 structurally- aligned protein pairs from Holm's test set, each 
of which is in superfamily-level or family-level similarity. 80% alignment 
accuracy is achieved when the same set of weight factors is used. 

4.4 z-score and Fold Recognition 

After threading one pair of sequence and template, z-score is calculated 
according to the method proposed in paper[19] to cancel out the composition 
bias. Let z Taw denote this kind of z-score. However, since the accurate z raw 
is expensive to compute, we just approximate it by (i) fixing the alignment 
positions; (ii) shuffling the query sequence randomly; and (iii) calculating the 
alignment scores based on the existing alignment rather than doing optimal 
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alignments again and again. The free software SVM light[20] with RBF 
kernel is employed to adjust the approximate z-score. Due to space limit, we 
refer the reader o Vapnik's book[21] for a comprehensive tutorial of SVM. A 
set of 60000 training pairs formed by all-against-all threading between 300 
templates (randomly chosen from the FSSP database) and 200 sequences 
(randomly chosen from Holm's test set [16]) is used as the training samples 
of our SVM model. The relationship between two proteins is judged based 
on SCOP database[22]. If one pair is in at least fold-level similarity, then 
it is treated as a positive example, otherwise a negative example. Each of 
training samples consists of the following features: (l)z raw ; (2) the sequence 
size; (3) the template size; (4) the number of cores in the template; (5) 
the sum of the core size in the template; (6) the number of aligned cores; 
(7) the number of aligned positions; (8) the number of identical residues; 
(9) the number of contacts with both ends on the aligned cores: (10) the 
number of cut contacts with one end on the aligned cores and the other on 
the unaligned cores; (11) the total score; (12) mutation score; (13) singleton 
fitness score; (14) gap score; (15) secondary score; (16) pair score. Given 
one threading result, SVM outputs a real value. The value greater than 
0 means this threading pair is in at least fold-level similarity. We do not 
use tins directly due the abundance of the false negatives. We calculate the 
final z-score for each query sequence. For all threading pairs of one given 
sequence, let 01,02,.. .,o v denote the outputs from SVM model. The final 
z-score is calculate by , where u{6) is the mean value of and std{o) 

is the standard deviation of 0*. Daniel Fischer's benchmark[8] is used to fix 
the parameters of the model. 

5 Preliminary Experimental Results 

Fischer's benchmark consists of 68 target sequences and 301 templates. 
RAPTOR ranks 56 pairs out of 68 pairs as top 1, achieving 82% predic- 
tion rate, while the previous best was 76.5%. 

The fold recognition performance of RAPTOR was further tested on Lin- 
dahl's benchmark set consisting of 976 protein sequences [23]. By threading 
them all against all, there are totally 976 x 975 pairs. We measure RAP- 
TOR'S performance in three different similarity levels: fold, superfamily and 
family. The results are shown in Table 1. The results of other methods are 
taken from Shi et aFs paper [24]. 

As shown in Table 1, the performance of RAPTOR at the fold level 
is much better than the others. At the superfamily level, RAPTOR per- 
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method 


Family 

Top 1 Top 5 


Superfamily 
Top 1 Top 5 


Fold 
Top 1 Top 5 


JrCAr* A XJtL 


75.2 


77.8 


39.3 


50.0 


25.4 


45.1 


RAPTOR-no 


68.0 


72.8 


34.0 


49.7 


19.0 


36.6 


FUGUE 


82.2 


85.8 


41.9 


53.2 


12.5 


26.8 


PSI-BLAST 


71.2 


72.3 


27.4 


27.9 


4.0 


4.7 


HMMER-PSIBLAST 


67.7 


73.5 


20.7 


31.3 


4.4 


14.6 


SAMT98-PSIBLAST 


70.1 


75.4 


28.3 


38.9 


3.4 


18.7 


BLASTLINK 


74.6 


78.9 


29.3 


40.6 


6.9 


16.5 


SSEARCH 


68.6 


75.7 


20.7 


32.5 


5.6 


15.6 


THREADER 


49.2 


58.9 


10.8 


24.7 


14.6 


37.7 



Table 1: The performance of RAPTOR at three different similarity levels 

forms a little bit worse than FUGUE [24], the best method (for superfamily 
and family level) listed in this table. However, at the family level, RAP- 
TOR performs better than only THREADER, which means that RAPTOR 
is superior in recognizing fold-level similarity but bad in doing homology 
detection. RAPTOR-np is a variant of RAPTOR without considering pair- 
wise interactions when doing optimal alignment, but the pairwise score is 
still calculated based on the non-pairwise alignment. The corresponding 
weight factors and SVM model are optimized separately using the same sets 
of training samples. Compared with RAPTOR-np, RAPTOR is better in 
fold level and superfamily level and same in family level. Thus, we may 
conclude that a strict treatment of the pairwise interactions is necessary for 
fold level recognition or even superfamily level. 

6 Computing Efficiency Issues 

An outstanding advantage of our algorithm is that the memory requirement 
is just about 0(M 2 n 2 ) and, at most of time, the computing time does not 
increase exponentially with respect to the sequence size. Figure 1 shows 
the CPU time of threading 100 sequences (chosen randomly from LindahTs 
benchmark) with size ranging from 25 to 572 to a typical template 1191 of 
length 162 (with topological complexity [7] 3 and 12 cores). According to 
Xu et aUs paper[7], the computing time of PROSPECT is 0(Mn 5 ) and 
its memory usage is 0(Mn 4 ). The observed memory usage of RAPTOR is 
100 ~ 200M for most of threading pairs. Figure 1 shows that the computing 
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time of our algorithm increases very slowly with respect to the sequence 
size. In fact, we found out that our relaxed linear programming gave the 
integral solutions most of time or generated only a few branch nodes when 
the solution was not integral. 



° 08 

oo°o o°° 



8. 



o «^o" c 



300 400 
Size (Ms) 



Figure 1: computing time of threading 100 sequences to template 1191. 



7 Improving LP Efficiency 

Protein threading is a very effective method to do fold recognition. Profile- 
based method runs very fast but could only recognize easy targets (Homol- 
ogy Modelling targets). Pairwise contact potential plays an important role 
in recognizing hard targets (Fold Recognition targets). However, if pair- 
wise contacts are considered and variable gaps are allowed, then protein 
threading problem is NP-hard. Therefore, an efficient and exact algorithm 
is very indispensable for protein threading in order to do fold-level recog- 
nition. PROSPECT makes use of a clever divide-and-conquer algorithm to 
search for the globally optimal alignment between templates and sequences. 
Its computational complexity depends on the topological complexity of the 
template contact graph. It is very efficient for those templates whose con- 
tact graphs have low topological complexity. PROSPECT could effectively 
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exploit the sparseness of contact graphs. If the distance cutoff is 7 A, there 
are about 3 quarters of contact graphes which has low complexity (< 4). 
However, PROSPECT couldn't deal with those templates with high topo- 
logical complexity. The integer programming method could be used to deal 
with those templates with high complexity by exploiting the relationship 
between various kinds of scores contained in the energy function. However, 
for a long sequence and a complex contact graph, there are often tens of mil- 
lions of variables involved in the integer program. Even for a topologically 
complex contact graph, there is often some regular part contained in this 
graph, i.e., this graph might have a subgraph which is topologically simple. 
Integer programming approach could not automatically take advantage of 
the partial regularity of the contact graph. We could use divide-and-conquer 
algorithm to align this kind of regular subgraph to the sequence first. In 
this paper, we would like to combine the integer programming approach 
and the divide-and-conquer algorithm. Before employing the powerful inte- 
ger programming approach to formulate the problem, we first compress the 
contact graph to generate a dense contact graph by some graph reduction 
operation. During the process of reduction, simple dynamic programming 
and divide-and-conquer algorithm could be used to align one segment to the 
sequence. 

7.1 Hyper Graph-Based Integer Program Formulation 

In this section, we will present a hypergraph-based integer program formu- 
lation which could be regarded as the generalization of the integer program 
formulation mentioned in the above sections. This kind of formulation could 
be used in two aspects. First, if the threading energy function takes the 
multiple-wise rather than pairwise contact potential into account, then the 
template contact graph becomes the hypergraph. We need to deal with a 
hypergraph based threading problem directly. Second, even if only the pair- 
wise contact potential is considered, after graph reduction operation which 
would be discussed in details in the following sections, the reduced contact 
graph could become a hypergraph. In order to take advantage of the integer 
programming approach, we need to generate our simple graph based integer 
program formulation to the hypergraph-based formulation. 

Definition 7.1 We use an undirected contact hypergraph HCMG = (V,HE) 
to model the contact map graph of a protein template structure , V = 
{ci,C2, .»»cm}, where denotes the i th core and HE = {e = (c^.c^, ...,Cj fc )| 
there is one multiple-wise interaction among c% x ,Ci 2 , ...,Cj fc }. 
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The only difference between the hypergraph-based threading problem 
and the simple graph based threading problem is that the energy func- 
tion of hypergraph-based threading consists of the multiple-wise interaction 
preferences. If there is a multiple-wise interaction among *i ,f 21—1**1 Let 
P(s\i 52) —1 4*1*1.1*2! ■■•!**) denote the multiple-wise interaction score when 
aligning Si to template position then the objective of hypergraph-based 
threading is to minimize £^(si,*i) + E-P(*L;*2i--<i**:*i)*2) •••>**)- 

For any e = fenCs,,...,^) 6 HE, let VeMM^h indicate the multiple- 
wise interaction among x variable Xi u x x A^^^ '^i*,^* VcMM J* = ^ ^ an ^ 

only if core Ci T is aligned to sequence position l T (r = 1,2, ...,&). Then we 
could formulate the objective function of the hypergraph based threading 
as follows. 

Min J^x iriLr F(l r7 i r ) + 53»WiA..-Wfl 

X,I 2 ,...,ijb,C 

Where F(J r ,i r ) denote the single- wise score when core z r is aligned to se- 
quence position l T and P(fi,/2> — ,/fe,e) the multiple-wise interaction score 
when e = (ci t , Cj 2 , Ci fc ) and core Cf r is aligned to sequence position Z r . 

Let D[i] denote the set of sequence positions that core C{ could be aligned 
to and R[i, j, I] the set of sequence positions that core Cj could be aligned to 
given that core Ci is aligned to sequence position /. 

The constraint set is as follows: 

E *y = 1 ( 22 ) 

For any e = (c^c^, ...,CtJ € VZ r € Z>[i r ], we have 

'p€/e[«r,Wr] 

€{0,1} 
*<A €{0,1} 

7.2 Contact Graph Reduction 

For a long protein template and a long query sequence, the number of the 
integer variables is often very huge, which will take the integer program a 
very long time to find the optimal solution from the search space of ex- 
ponential size. Before applying the integer programming approach to the 
threading problem, we try to decrease the number of the integer variables 



(23) 

(24) 
(25) 
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d c2 c3 C4 c5 c6 d c8 c9 ClO cll 



Figure 2: Template Contact Graph. 




cl c4 c8 c9 clO cll 



Figure 3: Compressed Contact Graph. 

by employing graph reduction technique onto the template contact graph. 
This te chni que is effective because many template contact graphs are sparse 
or contain sparse subgraphs. 

7.2*1 Graph Reduction Operation 

If there are two cores C£,Ca(« < k) such that Vj(t < jf < fc), N(cj) € 
{cijCi+^.-.jCjfc} and the subgraph formed by Ci+i,Ci+2, ».,Cjb-i is a graph 
with low topological complexity such as a nested graph, then we can first 
aUgn segment (ci,c*) to the sequence by some algorithm with low compu- 
tational complexity such as divide-and-conquer or dynamic programming 
and then treat the whole segment (c*, c&) as one edge when formulating 
the integer program. As shown in the original figure 2 and the resultant 
figure 3, segment (ci,C4) and segment (c 4 ,ca) are reduced to two edges re- 
spectively cause the subgraphs formed by ci,C2,C3,c 4 and C4, C5, C6,C7,C8 are 
topologically simple. Segment (ci,c 4 ) could be aligned to the sequence by 
simple dynamic programming algorithm. Segment (c4,cs) could be aligned 
to the sequence by divide-and-conquer algorithm within low degree poly- 
nomial time. This kind of graph reduction operation will be formulated as 
follows. 
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Definition 7.2 Given a template contact graph CMG = (V,E), \V\ > 2, a 
subset RV = {a x ,Ci 2 ,...,Ci k } (i x < i 2 ,--.,< ik,k >2)ofV is called Reduced 
Vertex Set if We = (q 5 c p ) e E (I < p), either at least one of c h c p is in RV 
or there exists one j such that ij < I < p < 

Definition 7.3 Given a template contact graph CMG = (V(CMG) J E{CMG)) } 
and its Reduced Vertex Set RV, its Reduced Contact Graph RCMG could 
be constructed as follows: 
(i) V{RCMG) = RV; 

(iijWvi,v 2 e V{RCMG), if{v u v 2 ) £ E{CMG) } then add(v u v 2 ) to E{RCMG); 
(in) For any segment (c^c^J, let Ci^,^,...,^ denote all cores in RV 
which are adjacent to at least one core within segment (ci r ,Ci r+l ), we add 
one hyperedge {ci r) c ir+l} Ci h ,Ci h) .. n c ijp ) into E(RCMG). 

According to Definition 7.2 and Definition 7.3, we can see that for any 
given template contact graph, there are often many possible reduced contact 
grapbes(see figure). Therefore, which one is the best in terms of computa- 
tional complexity? Let RV = {c h} Ci 2 , ...,c ip } denote the reduced vertex 
set of the contact graph after graph reduction operation. Let T seg denote 
the computational complexity of all segments (ct r ,Ci r+1 ),(r = 1,2, when 
r = p, let r + 1 = 1) and T re d uce d denote the computational complexity of 
the integer programming approach on the reduced contact graph. Ideally, 
the best graph reduction operation should minimize the maximum of T seg 
and T re duced- It : s easy to theoretically analyze how much time it will take for 
aligning all segments (see the next subsection). However, it's hard to esti- 
mate the computational complexity of the integer programming algorithm on 
the reduced graph though there is a trivial prediction that T Teduced = 0(nP) 
(n is the sequence size.). In practice, T reduced is far smaller than 0(n p ). 
There are two factors which are related to the computational time of the 
integer program. One is the size of the reduced vertex set, and the other 
is the complexity of the hyperedge, i.e., the number of cores contained in 
one hyperedge. Real protein data shows that a good reduction is to malce 
T 3 eg = 0(Mn*), fc = 3,4,5 (M is the number of cores) and limit the com- 
plexity of the hyperedge to at most 3. For A; = 3 or the hyperedge complexity 
less than 3, the reduced contact graph is still a simple contact graph. 

8 Summary 

What is claimed is: 

1. A method of protein threading by linear programming. 
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2. Three formulations of optimizing energy functions by linear program- 
ming. 

3. A method for improving the linear programming formulation of the 
protein threading problem by graph reduction. 

4. A method of protein threading by linear programming such that the 
solutions axe more likely to be integral. 

5. The method of any one of claims 1-4, further comprising the step of 
calculating a threading energy function. 

6. The method of claim 5 further comprising the step of exploiting the 
relationship between various kinds of scores contained in the energy function. 

7. The method of any one of claims 1 - 4, further comprising the step of 
performing a support vector machine (SVM) algorithm. 

8. The method of claim 2, comprising the prior step of generating a 
dense contact graph. 

9. A method of alignment comprising the steps of: formulating the 
protein threading problem as a large scale integer programming problem; 
relaxing this problem to a linear programming problem; and solving the 
integer program by a branch-and-bound method. 

10. An apparatus for executing the method of any one of claims 1-9. 

11. A system executing the method of any one of claims 1-9. 

12. A computer readable memory medium storing software code exe- 
cutable to perform the method of any one of claims 1 - 9." 

i 
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